Path Sensitive Program Analysis Using Boolean Satisfiability
نویسندگان
چکیده
Recent advances in boolean satisfiability (SAT) solvers have made it possible to solve structured formulas on the order of a million variables. In this work we show how to efficiently transform program analysis problems directly into SAT instances. The translation is similar to generating verification conditions, but we avoid exponential growth by introducing temporary variables in place of using substitution. Unlike most approaches based on general purpose theorem provers, our transformation deals with modular arithmetic and bit operations in a natural and efficient manner. Our goal is to use SAT to solve some of the more difficult problems in applying dataflow techniques to automatically finding bugs in programs. We separate this task into two parts. First, an input program in a conventional programming language is filtered through a program abstractor. The output of the abstractor is a program in a simplified imperative programming language, with some properties annotated with assertions. Next, we translate this simplified language into SAT to check these assertions. We apply this technique to checking state machine properties, determining path feasibility, and detecting buffer overruns. 1. MOTIVATION Static analysis has been shown to be very effective at finding errors in large systems. A typical approach is to specify an invariant property of an interface as an abstract state machine. Once a property is specified, a program abstractor determines the (possibly null) effect of each statement on the abstract state machine. This formulation leads naturally to an inter-procedural dataflow analysis which determines the set of reachable states for each program point. The tool reports an error for each program point that can be reached in an error state. One problem with this approach is that false paths caused by correlated branches can lead to spurious error reports. In our approach, we take a demand-driven approach to this Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 2002 ACM X-XXXXX-XX-X/XX/XX ...$5.00. problem. First, a relatively cheap dataflow analysis like that performed by MC [14] or ESP [9] is performed to determine if any errors are possible. If no errors are possible, no further steps are necessary. If an error is reported, we translate the statements and control flow of the offending procedure into a boolean satisfiability (SAT) formula that is satisfiable if the error is reachable. Recent advances in SAT solvers have made it possible to solve instances with millions of variables [18, 22], and our preliminary results show that instances derived from small to moderate sized programs can be handled by these solvers. One benefit of this approach is flexibility. It will be sound if the effect of each statement is modeled conservatively by the program abstractor. On the other hand, a sound approach may require very conservative assumptions, resulting in too many false error reports for certain programs or properties. This approach allows us to trade off soundness for usefulness. For example, the program abstractor can take advantage of ad-hoc conventions or domain-specific knowledge to avoid spurious error reports in a potentially unsound way. In our basic approach, each value is precisely modeled at the bit level by a SAT formula, so the analysis can handle low-level features such as modular arithmetic and bitwise operations. Thus, we believe that our approach will be applicable to detecting integer overflows, off-by-one errors, and potentially buffer overflows, in addition to finite state machine properties. 2. TRANSLATION RULES In this section, we describe the translation of programs from a small imperative language called Simple into boolean formulas. The translation outlined in this section is similar to the verification condition generation algorithm given by Flanagan and Saxe [13], except that the result is a boolean formula instead of a formula containing more complex operations on variables. A boolean formula f is described by the grammar f ::= v | ¬f | f ∧ f | f ∨ f where Vars is a set of boolean variables such that v ∈ Vars. Boolean operators such as→, =, and ⊕ (xor) can be reduced to the basic operators, and we will use them freely when building boolean formulas. The translation from Simple to boolean formulas is done in two steps. In the first step, we simplify the source program by conservatively transforming while loops into if statements, S ::= skip | S; S | V ← E | if (B) S else S | while(B) S | assume(B) B ::= true | false | ¬B | B ∧ B | B ∨B | E comp E E ::= V | C | unknown | uop E | E binop E Figure 1: Abstract syntax for Simple, small imperative language. V represents variables, and C represents constants. In this paper uop ∈ {−}, binop ∈ {+,−, band } where band is bitwise and, and comp ∈ {<, =, 6=}. and renaming variables so that each variable is only assigned once along any path. This results in a simpler program in a simpler language which we call, unsurprisingly, Simpler. In the second phase, we translate the simplified program into boolean formulas. 2.1 Source Language Figure 1 gives the definition of Simple, the source language of the translation. Simple is a small imperative language with assignments, if statements, and while loops. Expressions are side-effect free; we handle a variety of operators ranging from modular addition and subtraction to bitwise and, or, and negation. The type system is trivial: values are either boolean (bool) or n-bit integers (int). All variables have type int. The expression unknown is a nondeterministic int value, which is used to model loops and unsupported operations. We use {· · · } to group statements and (· · · ) to group expressions when it helps clarity. 2.2 Phase I: Handling Assignments and Loops In an imperative language like C, variables can be and usually are assigned multiple times during their lifespan. Therefore they can take on different values at different program points. But in a boolean formula, each boolean variable can only take on one value. To bridge this gap, we borrow an idea from the SSA transformation [8]: we rename variables so that they are assigned only once on each path and therefore can only take on one value on each path. In order to rename variables at assignments, we use a symbol table. The symbol table maps variables in the program to fresh variables not in the original program. The symbol table operations we use are: • T (v) looks up the variable v in the symbol table T and returns the result. • Rename(T, V ) takes a table T and a set of variables V from the original program and returns a new symbol table with each v ∈ V mapped to a fresh variable not used anywhere else, and each v / ∈ V mapped to T (v). If the set is empty, Rename returns the table unchanged. Finally, we use the functions Def(S) and Ref(S) to denote the set of all variables defined and referenced in statement S and all of its sub-statements. With these symbol table operations, we can express the rules for the first step of the translation, shown in Figure 2. The rules define three translation functions, one for each syntactic category: E ; for int expressions, B ; for bool expressions, and S ; for statements. E ;: expri × table→ expri B ;: exprb × table→ exprb S ;: stmt× table→ stmt× table The function E ; takes an int expression and a table and recursively replaces each variable in the expression with the variable it maps to in the table. The function B ; does the same thing for bool expressions. The function S ; takes a statement and a table and results in a new statement where all variables assume only one value on any path and while loops are desugared. The function S ; also returns a new table with the current symbol table mapping at the end of the statement. Given a program P which consists of a single statement, we can derive the transformed program and a final table as: 〈P, Rename({}, Ref(P ) ∪ Def(P ))〉 S ; 〈P , T 〉 where Rename({}, Ref(P )∪Def(P )) is just a fresh table mapping every variable in the program into fresh variables. The core rules for the first step of the transformation are Assign, Seq, If, and While. We illustrate how these rules work on a few examples in Figure 3. Figure 3a illustrates how the rule for Assign renames a variable. It also illustrates how the rule for If introduces variables to unify the value after taking either of the branches. The introduced variables encode the intuition that if the true branch is taken, then the variables after the join point will have values from the true branch, and vice versa for the false branch. Figure 3b illustrates how the While rule conservatively models the effect of loops. We translate a loop as a conditional branch that either (1) executes the loop body one or more times if the entry condition is true, or (2) never executes if the entry condition is false. We simulate the effect of (1) by assigning all variables defined in the loop body to unknown, and then assuming the loop condition. This sets the variables to arbitrary values that could have caused entry into the loop body. After the loop body, we assume the negation of the loop condition, thus forcing the variables to arbitrary values that would force exit of the loop, thus simulating the last iteration of the loop. Without loop invariants, this approximation can be imprecise. We use the functions exec for the dynamic semantics of Simple: exec : stmt× env→ 2env For lack of space we omit a formal definition of exec. The exec function takes statements in Simple and an initial environment mapping variables to values and returns a set of possible final environments after executing the statement. The function is undefined if the statement loops forever, therefore exec is a partial function. We assume that the initial environment gives initial values to all variables used in the program. The return value is a set of environments because exec is nondeterministic; the nondeterminism is introduced by the unknown expression, which is used to model the nondeterministic choices of values in desugared while loops. Also, we will use the notation T [η] to mean η ◦ T. That is, if η(v) = a and T (v) = vi then T [η] maps vi 7→ a.
منابع مشابه
The Next Generation of Static Analysis Boolean Satisfiability and Path Simulation—A Perfect Match
متن کامل
Efficient Incremental Static Analysis Using Path Abstraction
ion Rashmi Mudduluru and Murali Krishna Ramanathan {mudduluru.rashmi,muralikrishna}@csa.iisc.ernet.in Indian Institute of Science, Bangalore, India Abstract. Incremental static analysis involves analyzing changes to a version of a source code along with analyzing code regions that are semantically affected by the changes. Existing analysis tools that attempt Incremental static analysis involves...
متن کاملar X iv : 1 70 8 . 04 02 8 v 1 [ cs . R O ] 1 4 A ug 2 01 7 Counterexample Guided Inductive Optimization Applied to Mobile Robots Path Planning ( Extended Version )
We describe and evaluate a novel optimization-based off-line path planning algorithm for mobile robots based on the Counterexample-Guided Inductive Optimization (CEGIO) technique. CEGIO iteratively employs counterexamples generated from Boolean Satisfiability (SAT) and Satisfiability Modulo Theories (SMT) solvers, in order to guide the optimization process and to ensure global optimization. Thi...
متن کاملPrecise micro-architectural modeling for WCET analysis via AI+SAT
Hard real-time systems are required to meet critical deadlines. Worst case execution time (WCET) is therefore an important metric for the system level schedulability analysis of hard real-time systems. However, performance enhancing features of a processor (e.g. pipeline, caches) makes WCET analysis a very difficult problem. In this paper, we propose a novel approach to combine abstract interpr...
متن کاملSMT-Based and Disjunctive Relational Abstract Domains for Static Analysis
Abstract Interpretation is a theory of sound approximation of program semantics. In recent decades, it has been widely and successfully applied to the static analysis of computer programs. In this thesis, we will work on abstract domains, one of the key concepts in abstract interpretation, which aim at automatically collecting information about the set of all possible values of the program vari...
متن کاملRouting in Optical and Non-Optical Networks using Boolean Satisfiability
Dijkstra's algorithm. Such conditions can include forcing the path to go through a specific node, forcing the path to avoid a specific node, using any combination of inclusion/exclusion of nodes in the path, etc. In this paper, we propose a new approach to solving the shortest path problem using advanced Boolean satisfiability (SAT) techniques. SAT has been heavily researched in the last few ye...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002